fine-grained late-interaction multi-modal retrieval
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering (Appendix)
We chose the Google Search corpus [Luo et al., 2021] for our question-answering system as it provides good coverage of the knowledge needed and is publicly available. Therefore, it is advised to conduct an ethical review prior to deploying the system in live service. Table 1 shows the data statistics of the OK-VQA dataset. We build a DPR retriever as a baseline for FLMR. Equally contributed as the first author 37th Conference on Neural Information Processing Systems (NeurIPS 2023). The inner product search (supported by FAISS [Johnson et al., 2019]) is used to train and In answer generation, we use t5-large and Salesforce/blip2-flan-t5-xl.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Dominican Republic (0.04)
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) similarity scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained similarities.FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transform using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained similarities between queries and documents.
- Information Technology > Sensing and Signal Processing > Image Processing (0.82)
- Information Technology > Artificial Intelligence > Natural Language (0.65)
- Information Technology > Knowledge Management > Knowledge Engineering (0.59)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.59)
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering (Appendix)
We chose the Google Search corpus [Luo et al., 2021] for our question-answering system as it provides good coverage of the knowledge needed and is publicly available. Therefore, it is advised to conduct an ethical review prior to deploying the system in live service. Table 1 shows the data statistics of the OK-VQA dataset. We build a DPR retriever as a baseline for FLMR. Equally contributed as the first author 37th Conference on Neural Information Processing Systems (NeurIPS 2023). The inner product search (supported by FAISS [Johnson et al., 2019]) is used to train and In answer generation, we use t5-large and Salesforce/blip2-flan-t5-xl.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Dominican Republic (0.04)
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) similarity scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained similarities.FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transform using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained similarities between queries and documents.
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.88)
- Information Technology > Sensing and Signal Processing > Image Processing (0.86)
- Information Technology > Knowledge Management > Knowledge Engineering (0.62)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.62)